ParD antitoxin empirical landscape

In the following tutorial we will analyze the topography of a 3 aminoacid long landscape based on protein protein interaction between mutant sequences at ParD3 with two cognate toxins ParE2 and ParE3

1. Obtaining estimates for every possible sequence from incomplete experimental data

In contrast to theoretical or toy landscapes, empirical landscapes that are based on experimental data is noisy and incomplete, so we need to apply additional methods before we have estimates for every gentoype of interest to visualize the structure of the fitness landscape.

We provide a simple interface to perform Variance Component regression. Briefly, VC regression uses the empirical data to estimate a prior to perform Gaussian Process regression on functions over sequence space characterized by the contribution of the different orders of interaction to the sequence-dependent variability in the data. Then, we can estimate and predict the phenotype at every possible sequence assuming that the prior is still valid in regions of sequence space or genotypes that have not been experimentally measured

Preparing the data

[1]:
# Import required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import holoviews as hv

from os.path import join

import gpmap.src.plot as plot

from gpmap.src.inference import VCregression
from gpmap.src.space import SequenceSpace
from gpmap.src.randwalk import WMWSWalk
from gpmap.src.genotypes import select_genotypes
from gpmap.src.seq import translate_seqs
from gpmap.src.settings import TEST_DATA_DIR
[2]:
fpath = join(TEST_DATA_DIR, 'GSE153897_Variant_frequency.csv')
data = pd.read_csv(fpath, index_col=0)
data.head()
[2]:
ParE3_rep1_t0 ParE2_rep1_t0 ParE3_rep2_t0 ParE2_rep2_t0 ParE3_rep1_t600 ParE2_rep1_t600 ParE3_rep2_t600 ParE2_rep2_t600
AAA 0.000091 0.000075 0.000086 0.000076 0.000110 0.000003 0.000100 4.850000e-06
AAC 0.000057 0.000065 0.000067 0.000068 0.000032 0.000001 0.000032 2.100000e-06
AAD 0.000049 0.000034 0.000041 0.000041 0.000025 0.000001 0.000019 6.470000e-07
AAE 0.000057 0.000052 0.000054 0.000049 0.000277 0.000002 0.000298 1.130000e-06
AAF 0.000133 0.000125 0.000133 0.000121 0.000158 0.000007 0.000151 7.760000e-06
[3]:
data = data + data[data>0].min(0)
[4]:
parE2_cols = ['ParE2_rep1', 'ParE2_rep2']
parE3_cols = ['ParE3_rep1', 'ParE3_rep2']
cols = parE2_cols + parE3_cols

logfitness = pd.DataFrame({col: np.log2(data['{}_t600'.format(col)] / data['{}_t0'.format(col)])
                           for col in cols}).dropna()

logfitness['parE2_mean'] = logfitness[parE2_cols].mean(1)
logfitness['parE2_var'] = logfitness[parE2_cols].var(1)
logfitness['parE3_mean'] = logfitness[parE3_cols].mean(1)
logfitness['parE3_var'] = logfitness[parE3_cols].var(1)

stops = np.array(['*' in x for x in logfitness.index])
stops_logf = logfitness.loc[stops, :]
logfitness = logfitness.loc[stops == False, :]

logfitness.head()
[4]:
ParE2_rep1 ParE2_rep2 ParE3_rep1 ParE3_rep2 parE2_mean parE2_var parE3_mean parE3_var
AAA -4.927789 -4.196607 0.058173 -0.031619 -4.562198 0.267314 0.013277 0.004031
AAC -5.659228 -5.202646 -1.186775 -1.356459 -5.430937 0.104234 -1.271617 0.014396
AAD -5.169573 -6.120851 -1.351620 -1.562352 -5.645212 0.452465 -1.456986 0.022204
AAE -5.206682 -5.643856 1.936771 2.074769 -5.425269 0.095560 2.005770 0.009522
AAF -4.317164 -4.109532 0.085886 0.017124 -4.213348 0.021555 0.051505 0.002364

Run VC regression

[5]:
vc = VCregression()
[6]:
pred = {}
lambdas = {}
for parE in ['parE2', 'parE3']:
    vc.fit(X=logfitness.index, y=logfitness['{}_mean'.format(parE)],
           variance=logfitness['{}_var'.format(parE)], cross_validation=True)
    lambdas[parE] = vc.lambdas
    pred[parE] = vc.predict()['function']

pred = pd.DataFrame(pred, index=vc.genotypes)
pred.head()
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00,  6.30it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00,  6.08it/s]
[6]:
parE2 parE3
AAA -4.406670 -0.005151
AAC -5.272305 -1.287416
AAD -5.294139 -1.590561
AAE -5.439601 2.013529
AAF -4.239208 0.039020
[7]:
# Store for later use
pred.to_csv('../data/parD_parE.csv')
[8]:
fig, subplots = plot.init_fig(1, 3, colsize=4, rowsize=4)

# Plot variance components
axes = subplots[0]
axes2 = subplots[1]

for c, (label, l) in zip(['purple', 'orange'], lambdas.items()):
    x = np.arange(1, l.shape[0] + 1)
    axes.scatter(x, l, c=c, label=label)
    axes.plot(x, l, c=c)

    variance = vc.lambdas_to_variance(l)
    p_variance = variance / variance.sum() * 100

    axes2.scatter(x[:-1], p_variance, c=c, label=label)
    axes2.plot(x[:-1], p_variance, c=c)

axes.set(xlabel='$k$', ylabel='$\lambda_k$', yscale='log')
axes2.set(xlabel='$k$', ylabel='% variance explained', ylim=(0, 100), xticks=[1, 2, 3])
axes.legend(loc=1)

# Plot estimates vs observations
axes = subplots[2]
sns.histplot(x=pred.loc[logfitness.index, 'parE3'],
             y=logfitness['parE3_mean'], cmap='Blues', ax=axes)
axes.set(xlabel='VC log(Fitness)', ylabel='Average observed log(Fitness)', title=parE)
fig.tight_layout()
../_images/usage_3_ParD_10_0.png

We can see that both landscapes are very additive, as the additive component explains nearly 100% of the variability in the data, with a very good correspondance between estimates and experimental measurements in general

Compare estimated specificities

[9]:
fig, axes = plot.init_fig(1, 1, colsize=4, rowsize=4)

sns.histplot(x=pred['parE3'], y=pred['parE2'], cmap='Blues', ax=axes)
axes.set(xlabel='Fitness with E3', ylabel='Fitness with E2')

[9]:
[Text(0.5, 0, 'Fitness with E3'), Text(0, 0.5, 'Fitness with E2')]
../_images/usage_3_ParD_12_1.png

2. Visualizing the protein landscape

Now that we have generated estimates of the function for every possible 3 aminoacid sequences, we can build our discrete sequences space and calculate the embedding under the specified evolutionary model

[10]:
space = SequenceSpace(X=pred.index.values, y=pred['parE2'].values)
print(space)
Sequence Space:
        Type: protein
        Sequence length: 3
        Number of alleles per site: [20, 20, 20]
        Genotypes: [AAA,AAC,AAD,...,YYV,YYW,YYY]
        Function y: [-4.41,-5.27,-5.29,...,-5.79,-5.94,-5.82]
[11]:
rw = WMWSWalk(space)
[12]:
plot.figure_Ns_grid(rw, nodes_color='function', ncol=3, nrow=3,
                    fmin=-2.5, fmax=2.5, nodes_size=5, edges_alpha=0.02)
../_images/usage_3_ParD_16_0.png
[13]:
rw.calc_visualization(mean_function=2, n_components=20)
print('Mean function at stationarity of 2 was reached at Ns={:.2f}'.format(rw.Ns))
Mean function at stationarity of 2 was reached at Ns=0.73
[14]:
nodes_df = rw.nodes_df
edges_df = rw.space.get_edges_df()
[15]:
fig, axes = plot.init_fig(1, 1)
plot.plot_relaxation_times(rw.decay_rates_df, axes=axes)
../_images/usage_3_ParD_19_0.png
[16]:
fig, subplots = plot.init_fig(2, 2, colsize=5, rowsize=5)


axes = subplots[0][0]
plot.plot_visualization(axes, nodes_df, nodes_color='function', edges_df=edges_df, cbar=False,
                        nodes_size=6, edges_alpha=0.01, sort_by='function', ascending=True)
axes.set(xlabel='', xticks=[])

axes = subplots[0][1]
plot.plot_visualization(axes, nodes_df, nodes_color='function', edges_df=edges_df, cbar=False, x='3',
                        nodes_size=6, edges_alpha=0.01, sort_by='function', ascending=True)
axes.set(ylabel='', yticks=[])

axes = subplots[1][0]
plot.plot_visualization(axes, nodes_df, nodes_color='function', edges_df=edges_df, cbar=False, y='3',
                        nodes_size=6, edges_alpha=0.01, sort_by='function', ascending=True)

plot.empty_axes(subplots[1][1])

fig.tight_layout()


../_images/usage_3_ParD_20_0.png

Allele grid

[17]:
plot.figure_allele_grid_datashader(nodes_df, fpath='parE2.alleles', edges_df=edges_df, x='1', y='2',
                                   nodes_resolution=500)

Interactive 3D visualization

At this scale, rendering of the edges in 3D interactive plot becomes to slow to be useful, so we can plot only the nodes to allow a faster experience

[18]:
plot.plot_interactive(nodes_df, z='3', nodes_color='function', nodes_size=2)

Alternatively, we can select only genotypes with high fitness and plot them with the calculated coordinates as a reduced landscape

[19]:
genotypes = (nodes_df['function'] > 1).values
fnodes_df, fedges_df = select_genotypes(nodes_df, genotypes=genotypes, edges=edges_df)
[20]:
plot.plot_interactive(fnodes_df, edges_df=fedges_df, z='3', nodes_color='function', nodes_size=4)

Exploring the 3D landscape we can obtain a qualitative udnerstanding of the landscape. We can see that nodes are represented in a tetrahedrum-like geometry. At the base, we have have 3 broad fitness peaks characterized by having I, L, R at position 2, while the other positions can vary freely. However, if we also have K at position 3 (XIK, XLK, XRK), we move up a bit towards the tip of the tetrahedrum as fitness increases. Once K is fixed at position 3, then we move to the tip by mutating position 2 to V (XVK) or M (XMK) directly, or through less fit intermediates XMX XXK XVX.

This may be compatible with a simple additive model as reported in the paper and suggested by our VC analysis. While the highest fitness sequences are located at XIK, XLK, XRK, we still require more mutations to go from the XIX, XLX, XRX ends to the XVK end, which is what is driving the main divergence axis, together with the symmetry of the different aminoacids allowed at position 2.

If we look at the relaxation times, we can see that diffusion axis 4 and 5 still have a similar contribution to diffusion axis 3. Can we see whether XMX and XVX form similar structures in diffusion axis 4 and 5?

[21]:
plot.plot_interactive(fnodes_df, edges_df=fedges_df, y='4', z='5', nodes_color='function', nodes_size=4)

Indeed we find now clusters corresponding to both XMX and XVX, with their corresponding subsequences XMK and XVK with improved fitness, further supporting the general idea of the additive model. It just may be that some of the alleles, and thus their combinations are slightly better than the others, such that they drive slightly more evolutionary distance and are shown in the first diffusion axis.

The landscape may be easily summarized into an additive model with positive contributions of the following alleles, where the first position does not matter too much. - Position 2: I, L, R, M, V - Position 3: K

TODO: fill in the details about the actual positions and have some insights into the structural requirements

3. Analyzing evolution using a codon model

While the evolutionary model in protein space provides helps us to understand the overal structure of the landscape from a qualitative point of view, evolution does not really take place on protein sequence directly, but on the DNA sequence encoding it. There are some aminoacid changes that are not allowed by the genetic code, while others happen much more frequently. To take these features into account, we can model evolution on nucleotide sequence using the fitnesses of the corresponding protein sequence that we have previously inferred

[22]:
space = SequenceSpace(seq_length=3, function=pred['parE2'],
                      use_codon_model=True, codon_table='Standard',
                      stop_function=stops_logf['parE2_mean'].mean())
codon_rw = WMWSWalk(space)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_436376/3842282793.py in <module>
      1 space = SequenceSpace(seq_length=3, function=pred['parE2'],
      2                       use_codon_model=True, codon_table='Standard',
----> 3                       stop_function=stops_logf['parE2_mean'].mean())
      4 codon_rw = WMWSWalk(space)

TypeError: __init__() got an unexpected keyword argument 'function'
[ ]:
plot.figure_Ns_grid(codon_rw, nodes_color='function', ncol=3, nrow=3, show_edges=True,
                    fmin=-2, fmax=3, nodes_size=2, edges_alpha=0.01, ascending=True)
[ ]:
codon_rw.calc_visualization(mean_function=2, n_components=20)
nodes_df = codon_rw.nodes_df
nodes_df['protein'] = translate_seqs(nodes_df.index, codon_table='Standard')
edges_df = codon_rw.space.get_edges_df()
[ ]:
nodes_df.head()
[ ]:
fig, axes = plot.init_fig(1, 1, colsize=5, rowsize=5)
plot.plot_relaxation_times(codon_rw.decay_rates_df, axes=axes)
[ ]:
fig, subplots = plot.init_fig(3, 4, colsize=5, rowsize=5)

edf = edges_df
edf = None

axes = subplots[1][1]
plot.plot_visualization(axes, nodes_df, cbar=False, nodes_color='function', edges_df=edf,
                        nodes_size=2, edges_alpha=0.01, edges_color='lightgrey',
                        sort_by='3', ascending=True)

axes = subplots[1][2]
plot.plot_visualization(axes, nodes_df, x='3', cbar=False, nodes_color='function', edges_df=edf,
                        nodes_size=2, edges_alpha=0.01, edges_color='lightgrey',
                        sort_by='1', ascending=True)

axes = subplots[2][1]
plot.plot_visualization(axes, nodes_df, x='1', y='3', cbar=False, nodes_color='function', edges_df=edf,
                        nodes_size=2, edges_alpha=0.01, edges_color='lightgrey',
                        sort_by='2', ascending=True)

axes = subplots[0][1]
plot.plot_visualization(axes, nodes_df, x='1', y='3', cbar=False, nodes_color='function', edges_df=edf,
                        nodes_size=2, edges_alpha=0.01, edges_color='lightgrey',
                        sort_by='2', ascending=False)

axes = subplots[1][0]
plot.plot_visualization(axes, nodes_df, x='3', y='2', cbar=False, nodes_color='function', edges_df=edf,
                        nodes_size=2, edges_alpha=0.01, edges_color='lightgrey',
                        sort_by='2', ascending=False)

axes = subplots[1][3]
plot.plot_visualization(axes, nodes_df, cbar=False, nodes_color='function', edges_df=edf,
                        nodes_size=2, edges_alpha=0.01, edges_color='lightgrey',
                        sort_by='3', ascending=False)


for i, j in [(0, 0), (0, 2), (0, 3),
             (2, 0), (2, 2), (2, 3)]:
    axes = subplots[i][j]

for axes in subplots.flatten():
    plot.empty_axes(axes)
    axes.set(xlabel='', ylabel='')
fig.tight_layout()


[ ]:
plot.savefig(fig, 'parD.E2.codon.full')

Now we can start to see how the rendering gets slower, specially as we require all the edges to be shown. To accelerate this process, we use provide some functions to do the same plots using datashader library.

[ ]:
fig = hv.render(plot.plot_holoview(nodes_df, edges_df=edges_df, nodes_resolution=400, edges_resolution=1200))
fig

Interactive 3D plot

To have a better idea of the overall 3D structure of the landscape it is usually helpful to be able to see it from different perspectives, so the 3D interactive plot may be useful. Again, we need to limit the plotting to high fitness sequences to be able to interact smoothly with it

[ ]:
genotypes = (nodes_df['function'] > 2.5).values
print('Number of remaining genotypes after filtering: {}'.format(genotypes.sum()))
[ ]:
fnodes_df, fedges_df = select_genotypes(nodes_df, genotypes=genotypes, edges=edges_df)
[ ]:
plot.plot_interactive(fnodes_df, #edges_df=fedges_df,
                      z='3', nodes_color='function', nodes_size=4,
                      fpath='pard.codon', text=fnodes_df['protein'])
[ ]:
nodes_df['dna'] = nodes_df.index
nodes_df.index = nodes_df['protein']
nodes_df.shape
[ ]:
plot.figure_allele_grid_datashader(nodes_df, fpath='parE2.codon.alleles', edges_df=edges_df, x='1', y='2',
                                   nodes_resolution=1000)
[ ]: